sa conference paper
Audio Driven Real-Time Facial Animation for Social Telepresence
Lee, Jiye, Li, Chenghui, Tran, Linh, Wei, Shih-En, Saragih, Jason, Richard, Alexander, Joo, Hanbyul, Bai, Shaojie
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.
- Asia > South Korea > Seoul > Seoul (0.40)
- Asia > China > Hong Kong (0.07)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
- (3 more...)
Low-Rank Adaptation of Neural Fields
Truong, Anh, Mahmoud, Ahmed H., Luković, Mina Konaković, Solomon, Justin
Processing visual data often involves small adjustments or sequences of changes, e.g., image filtering, surface smoothing, and animation. While established graphics techniques like normal mapping and video compression exploit redundancy to encode such small changes efficiently, the problem of encoding small changes to neural fields -- neural network parameterizations of visual or physical functions -- has received less attention. We propose a parameter-efficient strategy for updating neural fields using low-rank adaptations (LoRA). LoRA, a method from the parameter-efficient fine-tuning LLM community, encodes small updates to pre-trained models with minimal computational overhead. We adapt LoRA for instance-specific neural fields, avoiding the need for large pre-trained models and yielding lightweight updates. We validate our approach with experiments in image filtering, geometry editing, video compression, and energy-based editing, demonstrating its effectiveness and versatility for representing neural field updates.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > China > Hong Kong (0.07)
- (8 more...)
Teamwork: Collaborative Diffusion with Low-rank Coordination and Adaptation
Large pretrained diffusion models can provide strong priors beneficial for many graphics applications. However, generative applications such as neural rendering and inverse methods such as SVBRDF estimation and intrinsic image decomposition require additional input or output channels. Current solutions for channel expansion are often application specific and these solutions can be difficult to adapt to different diffusion models or new tasks. This paper introduces Teamwork: a flexible and efficient unified solution for jointly increasing the number of input and output channels as well as adapting a pretrained diffusion model to new tasks. Teamwork achieves channel expansion without altering the pretrained diffusion model architecture by coordinating and adapting multiple instances of the base diffusion model (\ie, teammates). We employ a novel variation of Low Rank-Adaptation (LoRA) to jointly address both adaptation and coordination between the different teammates. Furthermore Teamwork supports dynamic (de)activation of teammates. We demonstrate the flexibility and efficiency of Teamwork on a variety of generative and inverse graphics tasks such as inpainting, single image SVBRDF estimation, intrinsic decomposition, neural shading, and intrinsic image synthesis.
- Asia > China > Hong Kong (0.07)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Physics-Based Motion Imitation with Adversarial Differential Discriminators
Zhang, Ziyu, Bashkirov, Sergey, Yang, Dun, Shi, Yi, Taylor, Michael, Peng, Xue Bin
Multi-objective optimization problems, which require the simultaneous optimization of multiple objectives, are prevalent across numerous applications. Existing multi-objective optimization methods often rely on manually-tuned aggregation functions to formulate a joint optimization objective. The performance of such hand-tuned methods is heavily dependent on careful weight selection, a time-consuming and laborious process. These limitations also arise in the setting of reinforcement-learning-based motion tracking methods for physically simulated characters, where intricately crafted reward functions are typically used to achieve high-fidelity results. Such solutions not only require domain expertise and significant manual tuning, but also limit the applicability of the resulting reward function across diverse skills. To bridge this gap, we present a novel adversarial multi-objective optimization technique that is broadly applicable to a range of multi-objective reinforcement-learning tasks, including motion tracking. Our proposed Adversarial Differential Discriminator (ADD) receives a single positive sample, yet is still effective at guiding the optimization process. We demonstrate that our technique can enable characters to closely replicate a variety of acrobatic and agile behaviors, achieving comparable quality to state-of-the-art motion-tracking methods, without relying on manually-designed reward functions. Code and results are available at https://add-moo.github.io/.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > China > Hong Kong (0.07)
- North America > United States > New York > New York County > New York City (0.04)
- (7 more...)
AniMaker: Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
Shi, Haoyuan, Li, Yunxin, Chen, Xinyu, Wang, Longyue, Hu, Baotian, Zhang, Min
Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
- Asia > China > Guangdong Province > Shenzhen (0.77)
- Asia > China > Heilongjiang Province > Harbin (0.77)
- Asia > China > Hong Kong (0.08)
- (2 more...)
- Research Report (0.64)
- Workflow (0.46)
- Information Technology > Graphics (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Light-SQ: Structure-aware Shape Abstraction with Superquadrics for Generated Meshes
Wang, Yuhan, Chen, Weikai, Hu, Zeyu, Zhang, Runze, Yin, Yingda, Wu, Ruoyu, Luo, Keyang, Qian, Shengju, Ma, Yiyan, Li, Hongyi, Gao, Yuan, Zhou, Yuhuan, Luo, Hao, Wang, Wan, Shen, Xiaobin, Li, Zhaowei, Zhu, Kuixin, Hong, Chuanlang, Wang, Yueyue, Feng, Lijie, Wang, Xin, Loy, Chen Change
In user-generated-content (UGC) applications, non-expert users often rely on image-to-3D generative models to create 3D assets. In this context, primitive-based shape abstraction offers a promising solution for UGC scenarios by compressing high-resolution meshes into compact, editable representations. Towards this end, effective shape abstraction must therefore be structure-aware, characterized by low overlap between primitives, part-aware alignment, and primitive compactness. We present Light-SQ, a novel superquadric-based optimization framework that explicitly emphasizes structure-awareness from three aspects. (a) We introduce SDF carving to iteratively udpate the target signed distance field, discouraging overlap between primitives. (b) We propose a block-regrow-fill strategy guided by structure-aware volumetric decomposition, enabling structural partitioning to drive primitive placement. (c) We implement adaptive residual pruning based on SDF update history to surpress over-segmentation and ensure compact results. In addition, Light-SQ supports multiscale fitting, enabling localized refinement to preserve fine geometric details. To evaluate our method, we introduce 3DGen-Prim, a benchmark extending 3DGen-Bench with new metrics for both reconstruction quality and primitive-level editability. Extensive experiments demonstrate that Light-SQ enables efficient, high-fidelity, and editable shape abstraction with superquadrics for complex generated geometry, advancing the feasibility of 3D UGC creation.
- Asia > China > Hong Kong (0.07)
- Asia > Singapore (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
LVT: Large-Scale Scene Reconstruction via Local View Transformers
Imtiaz, Tooba, Chai, Lucy, Heal, Kathryn, Luo, Xuan, Park, Jungyeon, Dy, Jennifer, Flynn, John
Large transformer models are proving to be a powerful tool for 3D vision and novel view synthesis. However, the standard Transformer's well-known quadratic complexity makes it difficult to scale these methods to large scenes. To address this challenge, we propose the Local View Transformer (LVT), a large-scale scene reconstruction and novel view synthesis architecture that circumvents the need for the quadratic attention operation. Motivated by the insight that spatially nearby views provide more useful signal about the local scene composition than distant views, our model processes all information in a local neighborhood around each view. To attend to tokens in nearby views, we leverage a novel positional encoding that conditions on the relative geometric transformation between the query and nearby views. We decode the output of our model into a 3D Gaussian Splat scene representation that includes both color and opacity view-dependence. Taken together, the Local View Transformer enables reconstruction of arbitrarily large, high-resolution scenes in a single forward pass. See our project page for results and interactive demos https://toobaimt.github.io/lvt/.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.07)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- (2 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
StableMotion: Training Motion Cleanup Models with Unpaired Corrupted Data
Mu, Yuxuan, Ling, Hung Yu, Shi, Yi, Ojeda, Ismael Baira, Xi, Pengcheng, Shu, Chang, Zinno, Fabio, Peng, Xue Bin
Motion capture (mocap) data often exhibits visually jarring artifacts due to inaccurate sensors and post-processing. Cleaning this corrupted data can require substantial manual effort from human experts, which can be a costly and time-consuming process. Previous data-driven motion cleanup methods offer the promise of automating this cleanup process, but often require in-domain paired corrupted-to-clean training data. Constructing such paired datasets requires access to high-quality, relatively artifact-free motion clips, which often necessitates laborious manual cleanup. In this work, we present StableMotion, a simple yet effective method for training motion cleanup models directly from unpaired corrupted datasets that need cleanup. The core component of our method is the introduction of motion quality indicators, which can be easily annotated - through manual labeling or heuristic algorithms - and enable training of quality-aware motion generation models on raw motion data with mixed quality. At test time, the model can be prompted to generate high-quality motions using the quality indicators. Our method can be implemented through a simple diffusion-based framework, leading to a unified motion generate-discriminate model, which can be used to both identify and fix corrupted frames. We demonstrate that our proposed method is effective for training motion cleanup models on raw mocap data in production scenarios by applying StableMotion to SoccerMocap, a 245-hour soccer mocap dataset containing real-world motion artifacts. The trained model effectively corrects a wide range of motion artifacts, reducing motion pops and frozen frames by 68% and 81%, respectively. Results and code are available at https://yxmu.foo/stablemotion-page
- North America > Canada (1.00)
- Asia > China > Hong Kong (0.07)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Leisure & Entertainment > Games > Computer Games (0.64)
- Leisure & Entertainment > Sports (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- Information Technology > Artificial Intelligence > Vision > Video Understanding (0.49)
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
Xu, Zhenran, Wang, Yiyu, Yang, Xue, Wang, Longyue, Luo, Weihua, Zhang, Kaifu, Hu, Baotian, Zhang, Min
AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.
- Asia > China > Hong Kong (0.05)
- Asia > China > Guangdong Province > Shenzhen (0.05)
- Asia > China > Heilongjiang Province > Harbin (0.05)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.87)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)
Boosting 3D Object Generation through PBR Materials
Wang, Yitong, Xu, Xudong, Ma, Li, Wang, Haoran, Dai, Bo
Automatic 3D content creation has gained increasing attention recently, due to its potential in various applications such as video games, film industry, and AR/VR. Recent advancements in diffusion models and multimodal models have notably improved the quality and efficiency of 3D object generation given a single RGB image. However, 3D objects generated even by state-of-the-art methods are still unsatisfactory compared to human-created assets. Considering only textures instead of materials makes these methods encounter challenges in photo-realistic rendering, relighting, and flexible appearance editing. And they also suffer from severe misalignment between geometry and high-frequency texture details. In this work, we propose a novel approach to boost the quality of generated 3D objects from the perspective of Physics-Based Rendering (PBR) materials. By analyzing the components of PBR materials, we choose to consider albedo, roughness, metalness, and bump maps. For albedo and bump maps, we leverage Stable Diffusion fine-tuned on synthetic data to extract these values, with novel usages of these fine-tuned models to obtain 3D consistent albedo UV and bump UV for generated objects. In terms of roughness and metalness maps, we adopt a semi-automatic process to provide room for interactive adjustment, which we believe is more practical. Extensive experiments demonstrate that our model is generally beneficial for various state-of-the-art generation methods, significantly boosting the quality and realism of their generated 3D objects, with natural relighting effects and substantially improved geometry.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.16)
- Asia > China > Shanghai > Shanghai (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Leisure & Entertainment (0.68)
- Media > Film (0.54)